XNOR-Nets with SETs: Proposal for a binarised convolution processing elements with Single-Electron Transistors

Deep neural network (DNN) and Convolution neural network (CNN) algorithms have significantly increased the accuracies in cutting-edge large-scale image recognition and natural-language processing tasks. Generally, such neural nets are implemented on power-hungry GPUs, beyond the reach of low-power edge-devices. The binary neural nets have been proposed recently, where both the input activations and weights are constrained to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$+$$\end{document}+ 1 and − 1 to address this challenge. Here in the present proof-of-concept study, we propose a simple class of mixed-signal circuits composed of single-electron devices and exploit the nonlinear Coulomb staircase phenomena to alleviate the challenges of binarised deep learning hardware accelerators. In particular, through SPICE modeling, we demonstrate the realisation of space-time-energy efficient XNOR-Accumulation (XAC) operation, reconfigurabilty of XAC circuit to perform 1D convolution and a busbar design to augment a contemporary accelerator. These nanoscale circuits could be readily fabricated and may potentially be deployed in low-power deep-learning systems.

www.nature.com/scientificreports/ from a simple design of SET-based circuit that show huge gains in space-time-energy resources. Then we proceed on to reconfigure the XAC circuit to carry out the 1D convolution by taking a simple test case of 4 × 4 kernel convolution with input activations. Though the reconfigurability saves the chip area, it has its own fair share of disadvantages as discussed in the relevant section. Thus, we provide a solution with the simple busbar circuit of two-terminal SETs and also show the integrability of busbar into a contemporary binary neural net accelerator for further acceleration. Figure 1 shows the typical Coulomb staircase behaviour of current-voltage characteristics for an one level singleelectron system 13 . The results are obtained for a simple one-level device at two gate voltages, Vg = 0 V and Vg = 0.15 V. The current-voltage curves are obtained by analytically solving the Master equation for a single energy state. The initial point of the problem is to implement AND operation using the non-linear Coulomb staircase of a single SET. We use this one-level toy model to briefly discuss about the implementation of AND. Here, the two inputs are encoded as voltages of source and gate respectively and the output is encoded as the current. AND gate could be realised by encoding input logical 0 and 1 as voltages 0 V and 0.15 V respectively and output logical 0 and 1 as 0 A and 1.2e−7 A respectively. For example, inputs 0 and 0 gives output logic 0 and similarly, inputs 1 and 1 give output logic 1.

Results
With the basic philosophy of the problem laid down using a simple one-level model, we proceed on to a detailed study of the problem with a more rigorous Master equation based SPICE model 18 . Figure 2a depicts the typical Coulomb-staircase behaviour of a single-electron transistor obtained from SPICE-level simulation at gate voltages of 0 V and 0.045 V. The operation of AND as described above is illustrated in Fig. 2 taking an example input instance of 1 and 1.The output current measured at 3.6 nA is the logical output 1 for the given inputs. The inputs 0 and 1 of the Boolean gate could be set at the Coulomb levels of choice depending on the drive current and energy trade-off required in the application-specific circuit design. The current levels also offer high robustness to input voltage noises due to the Coulomb blockade of electrons. Moreover, the sensing margin of outputs namely I(1)/I(0) has an ideal value of infinity, in principle.
In Binary Neural Nets, the input activations and weights are binarised to either + 1 or − 1 and therefore, n-bit floating point Multiply-Accumulate operations are transformed to XNOR-Accumulate 5 . XAC remains the core operation of the binary deep neural networks consuming the bulk of space, time and energy resources. XNOR operation is derived from the AND operation by adding a second SET and inverting the input voltage signals as implemented by Eq. 1 and shown in Fig. 3. XNOR and XOR operations are also obtained in the previous studies 19,20 using a similar input and output encoding scheme but with a more complicated multi-gate device design and fabrication process.

XNOR and accumulate (XAC). The accumulation is done by POPCOUNT instruction in digital systems
and consumes vast resources of space and time for big-length inputs. The XAC circuit that builds on the previous XNOR circuit is shown in Fig. 4a that POPCOUNT the 2-bit length input. The circuit uses the Kirchhoff 's law of current to POPCOUNT the input. Long Cheng et al. 21 experimentally demonstrated a 4-bit POPCOUNT accelerator using a memristive array that similarly operates on the addition of currents. For illustration of the POP-COUNT operation, consider two binarised inputs A = [1 1] and B = [1 1]. The first bits of A and B are fed into V1 and V2, while second bits are fed into V3 and V4. SETs numbered 1,2,3 and 4 perform XNOR operation on the vectors and the POPCOUNT is read by measuring the output current Ipop. In the present example, the input voltages are fixed at 0.045 V and the output current is measured at 7.2 nA which is encoded as integer 2 (Fig. 4).
Reconfigurable XAC circuit. The advantage of the proposed XAC circuit is that it could be reconfigured to carry out the 1D and 2D convolutions 7 for a parallel binarised CNNs. Here, we demonstrate the reconfigurability of XAC circuit by taking a example of performing 1D row convolution of 4 × 4 kernel with input activations.  www.nature.com/scientificreports/ reconfigured XAC unit, with the voltages fixed at 0.045 V and 0 V to represent integer 1 and 0 respectively. The output current, I 1D measured at 7.2 nA encode the 1D convolution value 2 as shown in Fig. 5. It should be noted that current should be converted to appropriate voltage level with an additional converter circuit before feeding inputs into the reconfigured XAC circuit 22 .
Busbar circuit. The re-configurable XAC circuit provides a significant savings in transistor consumption and the chip area. But it is achieved at the expense of complex three-terminal transistor fabrication, delayed computation and importantly, integrating the SET-XAC circuit with other contemporary deep neural hardware www.nature.com/scientificreports/ could be challenging. Therefore, a simpler busbar circuit consisting of two-terminal single-electron junctions is proposed to address this problem. The crossbar of silicon quantum dots has already been demonstrated 23 for a more complicated architecture in quantum computing. In particular, we demonstrate the 1D convolution operation of the busbar circuit by integrating into and augmenting a contemporary hardware accelerator, XNORBIN 24 . XNORBIN, a completely digital and tapeout ASIC achieved the second-best result of 100 TOP/s/W designed on a Global Foundries 65nm node.
Here, we use a simple two-element busbar to demonstrate our proof-of-concept as shown in Fig. 6a and the present design could be scaled to arbitrary length. POPCOUNT and full-adder units of Basic processing unit (BPU) of XNORBIN are replaced with two busbars and the outputs of BPU XNOR ( voltage scaled appropriately ) are fed into the inputs of busbar. Since the busbar element length is 2, the input and size of the kernel is restricted to 2-bit vector and 2 × 2 respectively for our proof-of-concept illustration. Consider the outputs of XNOR to be www.nature.com/scientificreports/ 0 and 1 in both the BPUs (BPU0 and BPU1). As shown in Fig. 6, the output current I POP0 ( POPCOUNT from BPU0 = 1 ) generated from two-terminal single-electron junctions 1 and 2 of first busbar is fed into current-tovoltage converter (concomitantly POP1 from BPU1 = 1 is also fed ) to generate the appropriate input voltages for the second busbar. The POPCOUNT inputs are loaded onto the two-terminal single-electron junctions 3 and 4 of second busbar to get 1D convolution output of value 2. In principle, the complex digital adder circuits of XNORBIN could be replaced by a simple analog busbars to further accelerate the deep neural convolution nets. For a methodical performance analysis of our convolution processing elements, a rigorous system-level performance is required accounting for interconnects, current-sensing amplifiers, current-controlled voltage sources and is beyond the scope of present work. Nevertheless, an attempt is made to reasonably compare the SET based computation and the digital baseline taking fundamental XNOR and Full-adder(FA) operations on a 2-bit 2 inputs as the relevant parameter and tabulated in Table 1. Here, the comparison is benchmarked against a fully digital circuit 25 taking only the design that performed best in Power-Delay-Product(PDP) metrics. All the digital circuits have been designed on a 65 nm TSMC CMOS process technology node. The proposed fullswing XNOR gate consists of 7 transistors with the best PDP metric of 52.9 aJ. The FA, with the best PDP metric of 241.1 aJ, is a 22-transistor configuration that implements a well-known four-transistor 2-1-MUX structure. SET-based inverters 26 are considered to drive inverted inputs into the SET-XAC unit and the worst-case performance metrics of the XAC is reported. The parameters of the remaining SETs are also assumed to have similar values as SET-inverters. The SET based unit shows 66% lesser transistor utilisation, nearly-equivalent speed of XAC operation and an almost 4-order magnitude reduction in power dissipation at 1 GHz clock frequency.

Discussion
In summary, we demonstrate a SPICE-modeled simple mixed-signal single-electron transistor based circuits which capitalises on Coulomb staircase current-voltage characteristics, to possibly accelerate the binary deep neural nets. The present circuit design and the results could be readily fabricated and realised, on the condition that the maturity of SET fabrication be able to tackle reliability issues and the current proposal may find potential application either independently or complementing other ASICs in future low-power deep-learning devices.

Methods
To obtain analytical solution of one-level model, the following physical parameters are used: Right and Left tunneling rates = 1 meV, Single state energy = 50 meV, Charging energy = 100 meV, Temperature = 30 K. SPICE model parameters are as follows: Circuit simulations were carried out using freeware LTSPICE (https:// www.linear.com/solutions/1066). Capacitance of all junctions is fixed at 0.7 aF. Resistance of source junction = 10 M , Resistance of drain junction = 0.1 M , Temperature = 2 K.

Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.